Skip to content

Fix KubernetesExecutor leaking a Manager process when reading running task logs#68800

Merged
shahar1 merged 2 commits into
apache:mainfrom
astronomer:k8s-executor-lazy-manager
Jul 1, 2026
Merged

Fix KubernetesExecutor leaking a Manager process when reading running task logs#68800
shahar1 merged 2 commits into
apache:mainfrom
astronomer:k8s-executor-lazy-manager

Conversation

@kaxil

@kaxil kaxil commented Jun 22, 2026

Copy link
Copy Markdown
Member

When the API server serves logs for a RUNNING task instance, FileTaskHandler constructs the configured executor to call get_task_log() and never starts or shuts it down. KubernetesExecutor.__init__ eagerly created a multiprocessing.Manager(), which forks a serve_forever child process. Because the executor instance is cached per process, this leaks one orphaned Manager process (~350-400 MB resident) per API-server worker. With gunicorn/uvicorn worker recycling, each refreshed worker's Manager reparents to PID 1 and is never reaped, so they accumulate and push the API server toward OOM.

get_task_log() only needs the kube client and the pod namespace; it never touches task_queue/result_queue/_manager. The Manager is purely a scheduling-loop resource.

Closes #68693.

Fix

Create the multiprocessing.Manager() and its queues lazily in start() instead of __init__(). The scheduling loop (start/execute_async/sync/end) is their only consumer, and the scheduler always calls start() first. end() now no-ops when the executor was never started. Constructing the executor purely to read logs no longer spawns a Manager.

Design rationale

  • Why start() and not a lazy property? The queues are only used after the scheduler calls start(); nothing touches them between construction and start(). Tying creation to start() keeps the lifecycle explicit and pairs cleanly with end().
  • Consistent with LocalExecutor. LocalExecutor already defers its queue/process creation to start(), so it never leaked on this path. This change brings KubernetesExecutor in line with that pattern, which also covers LocalKubernetesExecutor (its embedded KubernetesExecutor is constructed in __init__ and only started via LocalKubernetesExecutor.start()).

Affected path

Backward compatibility

No public API change. task_queue/result_queue/_manager are None until start() is called; all internal consumers run after start(). Behavior for a started executor (the scheduler) is unchanged.

@boring-cyborg boring-cyborg Bot added area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues labels Jun 22, 2026
… task logs

The API server constructs a KubernetesExecutor solely to call get_task_log()
for RUNNING tasks (via FileTaskHandler -> ExecutorLoader.get_default_executor)
and never starts or ends it. KubernetesExecutor.__init__ eagerly created a
multiprocessing.Manager(), which forks a serve_forever child process. Because
that instance is cached per process and never shut down, the Manager child is
orphaned -- one leaked process (~350-400 MB resident) per API-server worker,
growing with worker recycling and pushing the API server toward OOM.

get_task_log() only needs the kube client and pod namespace; it never touches
the task/result queues. Create the Manager and its queues lazily in start()
(the scheduling loop is their only consumer), mirroring how LocalExecutor
already defers process/queue creation. end() now no-ops when the executor was
never started. Constructing the executor for log reading no longer spawns a
Manager.
@kaxil kaxil force-pushed the k8s-executor-lazy-manager branch from a3ebff1 to 1e9a18e Compare June 22, 2026 13:31
@kaxil kaxil requested a review from amoghrajesh June 22, 2026 16:47
@Miretpl

Miretpl commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

I didn't review both PRs in detail, but there's another PR that also fixes the issue: #68697.

@amoghrajesh amoghrajesh left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but CI needs fixing

@shahar1 shahar1 merged commit 2257e95 into apache:main Jul 1, 2026
102 checks passed
shahar1 added a commit that referenced this pull request Jul 1, 2026
…69216)

The added tests carried a multi-line docstring and inline comments that
restated what the test names and assertions already convey. Reducing them
to the minimum keeps the intent while dropping the narration, following up
on review feedback for #68800.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

API server leaks a KubernetesExecutor multiprocessing.Manager process per worker when viewing RUNNING task logs

5 participants